Major frontend refactor #313

ChAoSUnItY · 2025-11-15T19:16:37Z

This patch aims to resolve most current preprocessor issues seen in parser unit, including:

Inappropriate eager file inclusion expansion, this is caused due to file inclusion is always happening before actual if-else file guard can determine, which sometimes causes weird result.
Bad error message reporting system, current system uses source index to suggest user where the error is, which is not really helpful on pratical term.
Lack of expansion only compilation.
Lack of predefined macros, including __LINE__, __FILE__.

In this approach, we introduce a whole new phase dedicated to preprocessing (this practice can be seen in chibicc and other similar cc), instead of binding the preprocessing phase in few different places.

Current status

This draft is still in development, the below screenshot is comparison between result of cpp -E and out/shecc -E:

comparison

test.c:

#define B
#define EXP(...) __VA_ARGS__
#include "a.c"

int main()
{
#if defined(B)
    __LINE__;
    __FILE__;
#endif
    EXP(1, +, 2);
    printf("Hello, World!\n");
    return 0;
}

a.c:

#pragma once
int fib(void)
{
    A;
    __FILE__;
    return 0;
}

Current major compilation workflow hasn't affected by the patch, which is expected to resolve in later commits.

Memory overhead

Memory usage increasement is expected since we now use token_t struct to track position, and token stream will be generated at once for preprocessor and parser to use with. Once parser finished, we expect to free all tokens.

Summary by cubic

Introduces a dedicated preprocessing phase and rebuilds token handling to fix include guard behavior, add robust macro expansion, and improve error diagnostics. Adds an -E mode to output preprocessed code comparable to cpp -E.

New Features
- Standalone preprocessor with object/function-like and variadic macros.
- Conditional directives: #if, #ifdef, #ifndef, #elif, #else, #endif.
- Local #include support and #pragma once; parses <...> but system paths aren’t resolved yet.
- Predefined macros: LINE and FILE.
- -E expansion-only mode with preprocessed output emitter.
- Better token tracking and caching, with precise file:line:column locations.
- CI job to preprocess and build a stage1 artifact to validate -E output.
- libc: add ctype predicates (isdigit, isalpha, isalnum, isxdigit, isblank).
- libc: add fseek/ftell with lseek support for ARM/RISC-V.
Bug Fixes
- Include expansion now respects guards; files are included after condition evaluation.
- Error messages upgraded from raw source indices to readable diagnostics with caret highlights.
- Corrected #elif evaluation during preprocessing.
- Deferred string/char literal unescaping until after preprocessing; fixes hex and octal escape handling.

^{Written for commit ee045f8. Summary will update automatically on new commits.}

src/globals.c

src/lexer.c

src/globals.c

ChAoSUnItY · 2025-11-17T20:15:48Z

The initial result is benchmarked below, the memory overhead and minor performance overhead are expected to be the outcome of algorithm and data structure used in this patch:

The source files will be loaded once into memory (which used for later error diagnositc and tokenization)
The token stream will later be computed (or tokenized) and store in TOKEN_ARENA, and this is the major memory overhead expected to be produced in this patch
The computed token stream of each source files will also get cached
Later in preprocessor, each token will also get copied to be preprocessed, this is to prevent corrupted token integrity if later cached token stream is used. This is also expected to produce another major memory overhead.
After the preprocessing is done, comipler will pass the whole computed token stream to parser

The performance overhead is expected to be came from multiuple token stream traversal, as seen 3 times at least in lexer, preprocessor, and parser.

performance benchmark (stage 0)

Before

/bin/time -v out/shecc src/main.c:

        Command being timed: "./out/shecc ./src/main.c"
        User time (seconds): 0.16
        System time (seconds): 0.33
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.50
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 305664
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 76195
        Voluntary context switches: 0
        Involuntary context switches: 2
        Swaps: 0
        File system inputs: 0
        File system outputs: 728
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

After

/bin/time -v out/shecc src/main.c:

        Command being timed: "./out/shecc ./src/main.c"
        User time (seconds): 0.20
        System time (seconds): 0.44
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.66
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 384000
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 3
        Minor (reclaiming a frame) page faults: 96004
        Voluntary context switches: 44
        Involuntary context switches: 2
        Swaps: 0
        File system inputs: 1992
        File system outputs: 800
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

performance benchmark (stage 1)

Before

/bin/time -v out/shecc-stage1.elf src/main.c:

        Command being timed: "./out/shecc-stage1.elf ./src/main.c"
        User time (seconds): 1.49
        System time (seconds): 1.23
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.76
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 188772
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 64
        Minor (reclaiming a frame) page faults: 44474
        Voluntary context switches: 158
        Involuntary context switches: 5
        Swaps: 0
        File system inputs: 10880
        File system outputs: 728
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

After

/bin/time -v out/shecc-stage1.elf src/main.c:

        Command being timed: "./out/shecc-stage1.elf ./src/main.c"
        User time (seconds): 1.94
        System time (seconds): 1.50
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.47
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 233612
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 63
        Minor (reclaiming a frame) page faults: 55761
        Voluntary context switches: 50
        Involuntary context switches: 15
        Swaps: 0
        File system inputs: 9120
        File system outputs: 800
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

ChAoSUnItY · 2025-11-17T20:21:41Z

There are still some improvements left to do, e.g.:

Delay string / character literal escaped character computation, so that when passing -E the result don't have to convert back again.
Function & structure renaming, some leftover unused structures will be also removed as well.

ChAoSUnItY · 2025-11-18T10:08:43Z

I seem to unable to reply to the latest review suggestion so I'll reply here:

Would it be better to use a global array to store the strings and use token->kind as the index to retrieve the corresponding one?

Perhaps? The approach requires reorganization of token kind order, and I don't think it's necessary to just boost performance while debugging (I didn't encoutered significant performance overhead when calling it)?

ChAoSUnItY · 2025-11-19T05:40:54Z

I'm thinking an approach to validate our compiler's expansion only compilation flag (aka -E), the procedure is as follow:

make distclean config out/shecc
./out/shecc -E src/main.c > out/out.c
./out/shecc --no-libc -o out/shecc-stage1.elf out/out.c
If the above procedure succeeded, we may continue to bootstrap stage 2 executable with the above procedure as well

@jserv what do you think?

jserv · 2025-11-19T09:28:30Z

I'm thinking an approach to validate our compiler's expansion only compilation flag (aka -E), the procedure is as follow:

make distclean config out/shecc

./out/shecc -E src/main.c > out/out.c

Furthermore, we can adapt the preprocessor implementation from slimcc.

cubic-dev-ai

6 issues found across 19 files

Prompt for AI agents (all 6 issues)


Understand the root cause of the following 6 issues and fix them.


<file name="src/preprocessor.c">

<violation number="1" location="src/preprocessor.c:891">
#ifdef/#ifndef ignore #undef because they only test raw map membership. Once a macro is #undef’d it remains in the hashmap, so these directives still treat it as defined.</violation>
</file>

<file name="src/globals.c">

<violation number="1" location="src/globals.c:1674">
error_at() bounds its diagnostic line scan by src-&gt;capacity instead of src-&gt;size, so it can walk past the actual source text and read uninitialized memory when printing errors.</violation>

<violation number="2" location="src/globals.c:1700">
error() now calls printf with a %s placeholder but never passes msg, so every diagnostic goes through undefined behavior and the message is lost.</violation>
</file>

<file name=".github/workflows/main.yml">

<violation number="1" location=".github/workflows/main.yml:81">
`make out/shecc` ignores the matrix compiler, so the clang matrix leg still builds with the default host compiler instead of clang.</violation>
</file>

<file name="src/lexer.c">

<violation number="1" location="src/lexer.c:561">
Character literals reset the running column counter, so every subsequent token on that line is reported at the wrong column. Increment the column instead of overwriting it.</violation>
</file>

<file name="src/main.c">

<violation number="1" location="src/main.c:104">
`libc_token_stream` is a cached token list shared via `TOKEN_CACHE`, so linking the user tokens into its tail mutates the cache and causes future includes of `lib/c.c` to re-emit the previous translation unit. Build a fresh concatenated stream or copy the libc tokens instead of modifying the cached list.</violation>
</file>

Since this is your first cubic review, here's how it works:

cubic automatically reviews your code and comments on bugs and improvements
Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
Ask questions if you need clarification on any suggestion

_{Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR}

src/preprocessor.c

src/globals.c

.github/workflows/main.yml

src/lexer.c

src/main.c

ChAoSUnItY · 2025-11-23T17:54:04Z

#221 is related to this patch, as this patch also fixes error diagnostic context issue. We may also close it upon merged into master branch.

.github/workflows/main.yml

lib/c.c

src/globals.c

src/defs.h

ChAoSUnItY · 2025-12-09T13:07:33Z

It seems all static test stuite failed on stage 2 compilation after merge conflicts resolution and minor change introduced, which is weird.

DrXiao · 2025-12-09T13:16:39Z

After adding the -strace flag, the output shows that lseek() reports a "Bad file descriptor" error.

$ qemu-arm -strace out/shecc-stage1.elf --no-libc -o test test.c
...
11316 lseek(-9,0,SEEK_CUR) = -1 errno=9 (Bad file descriptor)
11316 lseek(-9,0,SEEK_CUR) = -1 errno=9 (Bad file descriptor)
11316 lseek(-9,0,SEEK_CUR) = -1 errno=9 (Bad file descriptor)
11316 lseek(-9,0,SEEK_CUR) = -1 errno=9 (Bad file descriptor)
11316 lseek(-9,0,SEEK_CUR) = -1 errno=9 (Bad file descriptor)
11316 lseek(-9,0,SEEK_CUR) = -1 errno=9 (Bad file descriptor)

lib/c.c

DrXiao

The following snapshots should be removed.

fib-arm.json
fib-riscv.json
hello-arm.json
hello-riscv.json

lib/c.c

src/defs.h

.github/workflows/main.yml

ChAoSUnItY · 2025-12-18T07:06:13Z

Current fseek and ftell lack of capability to correctly store state of file descriptor, I'm expected this issue to be resolved outside of this patch, I'll create another issue to track this.

src/defs.h

src/globals.c

src/lexer.c

This patch completely separate the preprocessor functionality from scanner-less parser to indepadent units, which allows compiler to expand and parse nested function-like macro, multi-token object-like macro, and more. Furthermore, the error diagnostic is rewritten to better allow user to find out where and what lexeme causes compiler to panic.

DrXiao · 2025-12-23T14:16:27Z

src/globals.c

+                if (c == '1')
+                    value += 1;


Rewrite it as value += (c == '1').

DrXiao · 2025-12-23T14:35:13Z

src/lexer.c

+    return NULL;
+}
+
+token_stream_t *lex_token_by_file(char *filename)


How about renaming it to gen_file_token_stream() or a similar name?

DrXiao · 2025-12-23T14:35:58Z

src/lexer.c

+    return tks;
 }

+token_stream_t *include_libc()


Similarly, how about gen_libc_token_stream()?

DrXiao · 2025-12-23T14:39:25Z

src/main.c

+
+        return 0;


Remove the blank line. Additionally, would it be more appropriate to use exit(0) to terminate execution, as done at L160?

DrXiao · 2025-12-23T14:41:24Z

lib/c.h

 typedef int *va_list;

+/* Character predicate functions */
+


Remove the blank line.

DrXiao · 2025-12-25T09:51:51Z

src/preprocessor.c

+{
+    tk = pp_lex_skip_space(tk);
+
+    switch (tk->next->kind) {


Is tk->next always valid?

DrXiao · 2025-12-25T13:20:18Z

src/preprocessor.c

+                     &tk->location);
+        }
+
+        tk = pp_get_operator(tk, &op);


Is this line necessary?

DrXiao · 2025-12-25T13:26:57Z

src/preprocessor.c

+
+        if (kind == T_cppd_if || kind == T_cppd_ifdef ||
+            kind == T_cppd_ifndef) {
+            tk = pp_skip_inner_cond_incl(tk->next->next);


Is it possible for tk->next to be NULL?

DrXiao · 2025-12-25T13:27:05Z

src/preprocessor.c

+        }
+
+        if (kind == T_cppd_endif)
+            return tk->next->next;


DrXiao · 2025-12-25T13:39:55Z

src/preprocessor.c

+                    if (pp_lex_peek_token(tk, T_open_bracket, false)) {
+                        bracket_depth++;
+                    } else if (pp_lex_peek_token(tk, T_close_bracket, false)) {
+                        bracket_depth--;
+                    }


Remove the braces { }.

jserv reviewed Nov 17, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

DrXiao reviewed Nov 17, 2025

View reviewed changes

src/lexer.c Outdated Show resolved Hide resolved

DrXiao reviewed Nov 17, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

src/globals.c Outdated Show resolved Hide resolved

ChAoSUnItY marked this pull request as ready for review November 22, 2025 17:13

cubic-dev-ai bot reviewed Nov 22, 2025

View reviewed changes

src/preprocessor.c Outdated Show resolved Hide resolved

src/globals.c Show resolved Hide resolved

src/globals.c Outdated Show resolved Hide resolved

.github/workflows/main.yml Show resolved Hide resolved

src/lexer.c Outdated Show resolved Hide resolved

src/main.c Show resolved Hide resolved

ChAoSUnItY force-pushed the feat/preproc branch from 51bd234 to 73552dd Compare November 23, 2025 17:50

ChAoSUnItY requested review from DrXiao and jserv November 23, 2025 17:54

jserv reviewed Nov 24, 2025

View reviewed changes

.github/workflows/main.yml Show resolved Hide resolved

ChAoSUnItY force-pushed the feat/preproc branch from 73552dd to 49ccd49 Compare November 24, 2025 08:26

DrXiao reviewed Nov 24, 2025

View reviewed changes

jserv reviewed Nov 24, 2025

View reviewed changes

src/defs.h Outdated Show resolved Hide resolved

ChAoSUnItY force-pushed the feat/preproc branch from 49ccd49 to 3fa250a Compare December 9, 2025 06:33

DrXiao reviewed Dec 9, 2025

View reviewed changes

lib/c.c Outdated Show resolved Hide resolved

lib/c.c Outdated Show resolved Hide resolved

ChAoSUnItY force-pushed the feat/preproc branch from 3fa250a to 8d4198d Compare December 9, 2025 17:28

DrXiao reviewed Dec 10, 2025

View reviewed changes

lib/c.c Outdated Show resolved Hide resolved

src/defs.h Outdated Show resolved Hide resolved

.github/workflows/main.yml Show resolved Hide resolved

ChAoSUnItY force-pushed the feat/preproc branch 3 times, most recently from 64c71d9 to 3f915cd Compare December 18, 2025 06:48

ChAoSUnItY requested a review from DrXiao December 21, 2025 05:54

DrXiao reviewed Dec 21, 2025

View reviewed changes

src/defs.h Outdated Show resolved Hide resolved

src/globals.c Outdated Show resolved Hide resolved

src/lexer.c Outdated Show resolved Hide resolved

src/lexer.c Show resolved Hide resolved

src/lexer.c Outdated Show resolved Hide resolved

ChAoSUnItY force-pushed the feat/preproc branch from 3f915cd to b914e5b Compare December 22, 2025 09:00

ChAoSUnItY force-pushed the feat/preproc branch from b914e5b to ee045f8 Compare December 22, 2025 09:03

DrXiao reviewed Dec 25, 2025

View reviewed changes

Major frontend refactor #313

Are you sure you want to change the base?

Major frontend refactor #313

Uh oh!

Conversation

ChAoSUnItY commented Nov 15, 2025 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current status

Memory overhead

Summary by cubic

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChAoSUnItY commented Nov 17, 2025

Before

After

Before

After

Uh oh!

ChAoSUnItY commented Nov 17, 2025

Uh oh!

ChAoSUnItY commented Nov 18, 2025

Uh oh!

ChAoSUnItY commented Nov 19, 2025

Uh oh!

jserv commented Nov 19, 2025

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChAoSUnItY commented Nov 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChAoSUnItY commented Dec 9, 2025

Uh oh!

DrXiao commented Dec 9, 2025

Uh oh!

Uh oh!

Uh oh!

DrXiao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChAoSUnItY commented Dec 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ChAoSUnItY commented Nov 15, 2025 •

edited by cubic-dev-ai bot

Loading